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© Simd array processor. 

© A single-instruction-multiple data SIMD array 
processor (10) comprising a multi-dimensional array 
(12) of processing elements P(ij) and control logic 
(14) for issuing global instructions to said array, is 
provided with processing elements which comprise a 
programmable decoding means for individually de- 
coding a global instruction, the programmable de- 
coding means of respective processing elements 
being programmable in response to a global load 
instruction from the control logic. 

In a particular example of a SIMD array proces- 
sor as described, the programmable decoder com- 
prises programmable look-up table for locally modi- 
fying selected bits of a global instruction, to form 
£2 locally modified bits and fixed decoding logic for 
^decoding the bits of the global instruction as re- 
received and the locally modified bits. In this proces- 
sor, information defining the local modifications is 
^loaded Into the look-up table from storage in re- 
sponse to global load instruction, tn the described 
*Z example, the programmable decoding means are 
adapted to locally modify global information transfer 
° instructions, such that data may be transferred in a 
a. plurality of directions at one time within the array of 
UJ processors. 

A SIMD array processor in accordance with the 
present invention is particularly suitable for image 



processing applications and accordingly may be im- 
plemented as part of a display system. 
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© Slmd array processor. 

© A single-instruction-multiple data SIMD array processor comprising a multi-dimensional array of processing 
elements P(IJ) and control logic for issuing global instructions to said array, is provided with processing 
elements which comprise a programmable decoding means for Individually decoding a global instruction, the 
programmable decoding means of respective processing elements being programmable in response to a global 
load Instruction from the control logic. 

In a particular example of a SIMD array processor as described, the programmable decoder comprises 
programmable look-up table for locally modifying selected bits of a global instruction, to form locally modified 
bits and fixed decoding logic for decoding the bits of the global instruction as received and the locally modified 
bits. In this processor, information defining the local modifications is loaded into the look-up table from storage In 
response to global load instruction, in the described example, the programmable decoding means are adapted 
to locally modify global information transfer instructions, such that data may be transferred In a plurality of 
directions at one time within the array of processors. 

A SIMD array processor in accordance with the present Invention is particularly suitable for image 
^processing applications and accordingly may be implemented as part of a display system. 

IN 

rs 

CM 



CO 



LU 



Xerox Copy Centre 



BP 0 314 277 A2 



SIMD ARRAY PROCESSOR 



The present invention relates to a single-lnstruction-multiple-data (SIMD) array processor comprising a 
multidimensional array of interconnected processors. 

There are essentially two types of array processors - MIMD (multiple-lnstructlon-multiple-data) and 
SIMD (single-instruction- multiple-data). In a MIMD array processor each of the processing elements in the 

s array executes its own unique instruction stream with its own data. This contrasts with an array processor of 
the type to which the present invention Is directed, that is a SIMD array processor, in that the individual 
processing elements operate Instead under the control of a common, or global instruction stream from a 
single control unit As the individual processing elements operate under the control of the common 
instruction stream, this means that a SIMD machine is less flexible and can execute a more limited range of 

10 functions in parallel than a MIMD machine. However, the parallel processing elements of a SIMD machine 
are typically simpler and more numerous than in a MIMD processor. 

Many SIMD array processors consist of a two dimensional array of processing elements, each 
processing element being connected to its nearest neighbours to form a so-called NEWS (North, East, 
West South) network. Examples of array processors of this type are the ICL Distributed Array Processor 

15 (DAP), and the "Connection Machine" which are described In "Parallel Computers" by Hockney & 
Jesshope, Adam Hilger Ltd. 1981, pp 182-184 and "The Connection Machine" by W. Daniel Hillls, MIT 
Press 1986, pp 74-78 respectively. UK -A- 1445714 is also illustrative of a prior art SIMD array processor. 

An example Of the lack of flexibility of a SIMD machine arranged as a NEWS network can be seen with 
regard to a shift instruction. In a conventional NEWS networks ail processing elements will receive data 

20 from their neighbour 1 place away in a given direction e.g. South. The direction of shift for each processing 
element in a conventional NEWS network Is globally and uniformly determined as a parameter of the global 
machine instruction being executed with the result that all processing elements shift data in the same 
direction. A typical Instruction would be to shift the data 3 places North. Some machines are able to 
selectively enable the processing elements in the NEWS network using a mask function so that the 

25 processing elements which have been enabled receive the global instructions. Also, in European Patent 
Application EP-A-208 457, a processor array is described on which each processing element in the array Is 
able to select the element from which it takes its input 

The object of the present invention is to provide a SIMD array processor comprising a multi- 
dimensional array of processing elements which has an enhanced degree of flexibility to enable the 

30 potential for parallel processing to be better exploited without resorting to the expense and complexity of a 
MIMD processor. 

In accordance with the present invention there is provided a SIMD array processor comprising a multi- 
dimensional array of processing elements and control logic for issuing global instructions to said array, 
characterised in that a processing element includes programmable decoding means for the individual 
05 decoding for execution by that processing element of a global instruction, the programmable decoding 
means of respective processing elements being programmable in response to a global load instruction from 
the control logic. 

In a particular embodiment of a SIMD array processor in accordance with the invention, which 
embodiment is to be described hereinafter, the programmable decoding means comprises programmable 
40 modifying means for locally modifying selected bits of the global instruction to form locally modified bits 
and fixed decoding means for decoding bits of the global instruction as received and said locally modified 
bits. 

In that particular embodiment, the processing elements in the array are each associated with storage for 
control information and data and, in response to a global load instruction, the fixed decoding means in a 
4$ processing element causes modification information to be loaded from selected locations in said storage 
into the programmable modifying means of that processing element, whereby the programmable modifying 
means of respective processing elements may be programmed. Moreover, the processing elements In that 
embodiment are each associated with a corresponding block of said storage for control information and data 
and means are provided for the control logic to access said storage for storing appropriate modification 
so information for the programmable decoding means in corresponding locations in each said block of storage. 

In the particular embodiment to be described hereinafter the programmable modifying means com- 
prises a look-up table having a serial write port for receiving said modification information serially from 
storage and a parallel read port for receiving said selected bits of a global instruction in parallel. 

Also, in the particular SIMD array processor to be described hereinafter the programmable decoding 
means are adapted to be programmed to locally modify a global shift instruction, whereby data distributed 
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throughout the array may be shifted at one time in a plurality of locally determined directions wrthin the 
tZ The processing elements In this processor are interconnected in a plurality of orthogonal d.rectlons. 
Sh process^ element comprises a plurality of output registers and a multi-way input ™ltip lexer such 
!S SH simultaneous shift operations can be performed in each of a pluraiity of orthogonal directions 

by ^rertato^tt control of a SIMD array processor In accordance with the present invention gives 
rise to a family of algorithms which it has not been possible to perform in parallel on prior SIMD array 
processors. Some of these are described later. A SIMD array processor in accordance with the present 
Invention is particularly suitable for image processing applications and accordingly it may be .mplemented 

^Therafotow'^E of a specific embodiment of the present Invention with reference to the 
accompanying ^^^^ diagram i|lustrating ^ everall M * of a typical SIMD array 

processor ^ ^ a schematlc bto(M diagram illustrating the principle components of the array controller of 
an embodiment of a SIMD array processor in accordance with the present Invention; 

Figure 3 illustrates the instruction format used In the array controller of Figure 2; 

Figure 4 Is a schematic block diagram illustrating the principle components of an Individual 
orocessino element from the processor array of an embodiment of the present Invention. 

Figure 5 Is a schematic block diagram illustrating, In more detail, the decoder in the processing 

6 ' em % t |gure6 U |s a diagram Illustrating an algorithm which can be implemented on a SIMD array processor 
in accordance with the present Invention. 

Figures 7a and 7b are diagrams illustrating a further algorithm which can be implemented on a SIMD 
array processor in accordance with the present Invention. 

Figure 8 Is a schematic block diagram illustrating modifications to the processing element of Figure 

41 Figures 9 10 and 1 1 are diagrams illustrating algorithms which can be Implemented on a SIMD array 
processor in accordance with the present invention, with the modifications to the processing elements 
shown in Figure 8. 

Figure 1 illustrates a typical structure for a SIMD array processor 10. The processor 10 comprises an 
array 12 of processing elements Pflj). and an array controller 14 for issuing a stream of global instrucbons 
to the processing elements P(i,j). Each of the processing elements operates on a single bit at any one time 
and has associated therewith a block of storage (not shown). The processing elements are connected by a 
so called NEWS (North. East. West. South) network to their respective neighbours by bidirectional bit lines. 
tL processing element P(l j) is connected to the processing elements P(M J). P« -M) P0.H). and PJ + ,j) 
in trie Northern. Eastern. Western and Southern directions respectively. The NEWS network is ateo 
connected toroldaliy at its edges so that the Northern and Southern edges are bidirectional^ interconnected 
and the Western and Eastern edges are similarly interconnected. In order that data i may -be , input to and 
ouluUrom the array of processors, a controller-array data bus 26 is connected to the NEWS network. As 
shown it is connected to the East-West boundary of the array. It could equally be connected instead, or 
additionally, to the North-South boundary, or Indeed to each processing element. It * m ™*«°*} 0 w Jj 
East-West boundary by means of bidirectional tristata drivers which are connected to the toroidal East-Wes 
NEWS connections. It will be apparent to the skilled person that this is only one of many possible means of 
connection of the data bus 28. , , . 

The number of processing elements In the array can be chosen as required. A typical number, as is 
used in a specific embodiment of the present invention to be described later, is 32 X 32 = 102 hndMdual 
processing elements. For ease of illustration however, only 18 individual processing elements are shown 
Also for reasons of ease of illustration, only the principle connections which are necessary for an 
understanding of the operation of the processor are indicated In Figure 1. In Figure 1. as in , the "mwder * 
toe Figures, a double line connecting functional elements is used to represent a plurality of connection lines 
w a bus a single line indicates a single bit line. The lines may be unl- or bidirectional as appropriate as 

^TeX^ Paraile. to the processing elements via an instruction bus 18 

and^ uefrT select and column select signals via row select lines 20 and column ^selec : .toes 22 
respectively. These Instructions cause the processing elements to load data from storage, to process the 
data and then to store the data once more in storage. 
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Each processing element has access to a bit slice of main memory. Logically therefore, the main 
memory of the array processor is separated into 1024 slices for a 1024 processing element array. This 
means that up to thirty two 32-bit words can be transferred in or out of storage at one time. To perform a 
read or write operation, the memory is addressed In terms of an index address which is supplied to the 
memory address lines via an address bus 24 and a read or write instruction Is supplied to each of the 
processing elements in parallel. During a read operation, the row and column select signals on the row and 
column select lines identify which of the processing elements are to perform the operation. Thus it is 
possible, for example to read a single 32 bit word from memory into the 32 processing elements in a 
selected row. 

A host processor 28 is also shown in Figure 1. This processor is used to load microcode programs Into 
the array controller 14, to exchange data with it and to monitor its status via a host-controller data bus 30 
and an address and control bus 31. The host processor can be any suitable general purpose computer 
such as a mainframe computer or a personal computer. No further description of the host processor is 
necessary as this does not form part of the present invention, 
is The structure as described above is typical for prior art SIMD array processors. A processor of this type 
is described in UK-A-1 445 714. 

In the following, a specific embodiment of a SIMD array processor In accordance with the present 
invention, which also has the overall structure shown in Figure 1, will be described. It will be apparent from 
the following, however, that the present invention is not limited to the structure illustrated in Figure 1. For 
20 example, the array of processors could be organised on a 3-D, 4-D (using clusters), etc basis rather than on 
a 2-D basis. Also, instead of being configured as an item separate from a host processor, a SIMD array 
processor in accordance with the present invention may form an integral part of, for example, a display 
system such as a workstation with a display adapter. As will be clear from examples of algorithms which are 
described hereinafter, a SIMD array processor in accordance with the present invention is particularly 
25 suitable for image processing applications. 

Figure 2 Illustrates how the array controller shown in Figure 1 is structured in the specific embodiment 
of a SIMD array processor in accordance with the present Invention. Neither the detailed structure of the 
array controller shown in Figure 2, nor specific details of its operation are essential to the present invention. 
Consequently, the structure and operation of the controller will only be briefly described in the following. 
30 The array controller 14 comprises a microcode store 32 into which microcode defining the processing 
to be performed by the array processor is loaded by the host 28 using the data bus 30 and the address and 
control bus 31. Once the operation of the array controller 14 has been initiated by the host 28, the 
sequencing of the microcode is controlled by the microcode control unit 34 which is connected to the 
microcode store by bus 36. An ALU 38 and register bank 40 are used in the generation of array memory 
35 addresses which are output on the address bus 24, loop counting, jump address calculation and miscella- 
neous general purpose register operations. A flag line 39 is provided for conditional branching. A row mask 
PLA (Programmed Logic Array) 42 and a column mask PLA 44 are used for decoding row and column 
mask codes in a microinstruction being executed to generate signals on individual row select lines 20 and 
column select lines 22. Operation codes forming the instructions to the processing elements P(IJ) are fed 
40 onto the instruction bus 18. A data buffer 46 is shown between the host-controller data bus 30 and the 
controller-array data, bus 26. This allows data from the host which is to be written into the array of 
processors to be rapidly down-loaded into the array controller 14. The data can then be loaded, under 
control of the microcode, into the array of processors. Similarly, the buffer can be used for transferring data 
between the array and the host For this purpose the buffer Is arranged as a bidirectional FIFO buffer under 
45 control of the microcode control unit. 

The instruction format used in the specific embodiment of the present invention Is illustrated in Figure 
3. It should be noted that the format shown in Figure 3 is merely that used in the specific example of a 
SIMD array processor described herein. In other embodiments of the invention another format might well be 
used depending on the form of the controller for the processor array, the complexity of the individual 
so processing elements and so on as will be apparent from the following description to one skilled in the art 

The fields of the instruction which relate to the control of the processor array are the processing 
element operation code "PeOP", bits 63 to 56, the row mask code "Maskr", bits 55 to 48, and the column 
mask code "Maskc", bits 47 to 40. The "PeOP" field forms the operation code, or instruction, which is 
issued globally in parallel to processing elements in the array. The purpose of the row and column masks is 
ss to enable the instruction specified by the "PeOP" code to be executed by selected processing elements 
only. This enables memory read operations, inter-processor element shifts and intra-processor element 
register operations to be performed by selected processing elements only. The contents of the "Maskr" and 
"Maskc" fields are decoded by the row mask PLA 42 and the column mask PLA 44, respectively, for 
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settlna appropriate individual row select lines 20 and column select lines 22 

S9 TS fields shown in Figure 3 are all concerned with Ihe sequencing of the array controller and 
array meC^ conventional manner. T*e "Test" field, bits 39 to 36. dennes the 

ScTon Sow within the array controller and is fed to "test" input of the microcode 
n^uwi r . . n Alu0D » bits 35 to 32, defines the general operation of the controller ALU 

£5? JtoT^Tn^e bits 31 to 28. and "Regd". bits 27 to 24 are used for 
Lle^ng source and destination registers in the controls register bank 40 and are fed for this purpose to 
R and W control inputs, respectively, of the register bank. 

TWfS "Offset", bite 23 to 0. defines an argument for ALU operation and array memory address 
aeneration and is fed to the input A of the ALU 38. , . JiiMft . 

9 Sre 4 Illustrates the principle components of one of the indlvidua. P^ s, "^ n ^ 
processor array 12. It should be understood that each processing element operates on a single bit of data 

31 TeVcessing element comprises an ALU 43. which in the specific example of a P'o^ e^ent 
shown comprises Inputs labelled A. C. Q. M. and N and outputs A.C and Q. The outputs A. C, and Q are 
conlctedran A. J result register 50. a C. or carry register 52 and a Q, or NEWS aW^H 
each of which is able to store a single bit of information. The outputs of these registers c ° n ™f d 
t Mhe ^responding A. C and Q inputs to the ALU and also to a multiplexer 56. The multiplexer 56 enables 
VZS+l sellable one of the A, C and Q registers to be passed to its output 58. The ou*ut 58 of 
tne Splexe? 5 is connected to the M Input of the ALU and also to bidirectional data port 59 of the slice 
of memory I6(ij) associated with the processing element P(i,j). AUt ^ mtt 

Sch processing element Is associated with a slice, or block, of memory 16(1.,) one ^. Although 
this slice or block memory is logically included within the processing element it may In fad be phystaally 
senate therefrom. As each of the processing elements has a similar block of memory the ^ 32 btocks 
of memory of the array can be thought of as an array memory, compnsing a plurality of planes, each of 
iTh«mprises 32 3MR words. £h plane comprises a bit from each of the P"^' 1 ^** 
Corresponding index address. By supplying a single index address to the array memory via the address bus 
24 one of the planes of bits can be accessed. 

The output of the Q register also forms the NEWS output 6O0J) of the processing element P(J.» wh ch is 
conn^edTthe adjacent processing elements in the North (P(i-1j)). East (P(l.j + D). West P(l j-1» and 
Tp(I ♦ 14» directions. Data to be shifted Into the processing element P(l.i) from an adjacent element! In 
Is NEWS network Is selected by means of an input multiplexer 62 which is connected * *«NBOT . 
outputs 60 (H.i). 60(i.i+1). 60d.M) and 60(1+1,]) of the adjacent processing elements In the North East 
Westand South directions respectively. The output of the Input multiplexer 62 Is connected to the N input 

°' *The deration of the processing element is controlled by instructions, or operation codes, received from 
the aTray Sntroller 14 over the instruction bus 18. The operation codes "PeOP" from the array controler 
are received in parallel on the Instruction bus 18 from the array controller at the decoder 64 In each o the 
S££ TelenStTAs In prior art S1MD array processors, the status of the row and column select Hnes 
Stt*TZ processing element in question will determine whether the instruction U .performed by 
that processing element or not The decoder in each processing element to connected to the ^coum and 
row select lines appropriate for the position of the processing element in the array. For the processing 
Tmeri ?(i )Tn S «Trow and the jth column this will be the Ith row select line 201 and jth column select 
SS Sen both the row and the column select lines to a particular decoder 64 are selected, the 
decoS wHI decode the received operation code and thereby cause the processor to carry out the specified 
Stiorby issuing control signals over control buses 66. 68 and 70 to the input multiplexer 62. the ALU 

*tol£?$^X^ are performed In the ALU 4a The first type of operation is a routing 
oceri^icn a bit of input data is simply passed from the Input of the ALU to the output For example 
?K£ «?£ passed from a selected one of the NEWS Inputs to the multiple xer 62 via £at nput 
multipfoxer 62. the input N of the ALU and the Q output of the ALU to the Q register V"*"**™** 
NEWS output register From there the information Is output onto the NEWS network. S.m,lariy a bit of data 
2£ r uTed from a location in memory 16 specified by an address on the memory address bus * v,a 
ma bidirectional data port 59 of the memory, the M input and the Q output of the ALU to the Q register. A 
Teco^d baSpe of operation is an arithmetic one. in the processing element shown, the result register 50 
Z tt e 2£y Sister « are used principally for such operations. The actual operations which can be 
?£io™?to the ALU wtil depend on the Interna, structure of the unit This will not be descnbed In deta.1 
StTnot essenttotra understanding of the present invention. Typically the ALU Is implemented in a 
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conventional manner as will be evident to one skilled in the art 

In prior art SIMD array processors the decoder has been hard-wired, usually in the form of a plurality of 
hard-wired logic gates. The decoder in a SIMD array processor in accordance with the present invention is, 
in contrast programmable. In a particular embodiment of the invention, the decoder is programmable by the 
s provision of a look-up table which operates as an instruction modifier. Further, in this embodiment of the 
invention, only selected bits of the operation code Input to the decoder 64 are modified by the look-up 
table. 

Figure 5 illustrates the decoder 64 of this specific embodiment of the present invention in more detail. It 
comprises a first part 72 in the form of conventional fixed decoder logic such as hard-wired gates and a 

io second, programmable part in the form of a look-up table (LUT) 74. The bits of the operation code on lines 
18(i) to 18(vi) and the row and column select lines are input directly to the PLA as usual, but two of the bits 
18(v) and 18(vi) of the operation code are also used to address the look-up table in parallel The two-bit 
output of the look-up table location accessed by these bits forms modified operation bit lines 18(vii) and 18- 
(vlii) which are also Input to the fixed decoder logic 72. The fixed decoder logic 72 logically combines the 

;5 input data on lines 18(i)-(vl») to form the output control information on the control buses 66, 68 and 70. 

The look-up table shown In Figure 5 comprises four words of two bits each. These eight bits of data are 
loaded in series over a data line 76 from memory 16(1 J) (see also Figure 4) In response to global "load 
look-up table" Instruction from the array controller. As the look-up tables of the processing elements will 
contain Indeterminate information before being initialised, the "load look-up table" Instruction only uses 

so unmodified bits 180) to (vi) of the operation code as received , that is the bits that do not go through the 
look-up table. The decoder in each of the processing elements produces the^ontrol signals on the buses 
66, 68, 70 and on the control line 78 Internal to the decoder when this Instruction is received irrespective of 
the value of the lines 18(v) and 18(vi), The control line 78 Is a write enable line for the look-up table. 
Appropriate control, or modification, data for the look-up table of each of the respective processing 

25 elements would have been previously loaded via the data bus 26 and the Q register 54 into corresponding 
locations In the slice of memory 16(1 J) associated with each processing element P(i,j) so that the control 
data may be accessed and read into the look-up tables of each of the processing elements using global 
instructions and memory addresses. The storing of the control information is carried out by operating the 
array in a prior art manner using global unmodified instructions. 

so In a SIMD array processor according to the present invention there are essentially two sorts of global 
instructions defined by the "PeOP" operation codes. These are global instructions which cannot be 
modified locally and global instructions which can be modified locally. The first sort are global instructions 
which are used for initially loading data Into the processor 'array, for shifting the data through the array for 
storing that data in the array memory and for subsequently loading modification data into the look-up tables. 

os All other instructions can, In principle be, modified locally, but they can only be used when appropriate 
modification information has been loaded Into the look-up tables. The fixed decoder logic 72 logically 
combines the input data on lines 18(i) to 18(vi) (ie the unmodified instruction bits) in such a manner that it 
recognises whether the instruction being decoded is locally modifiable and whether the input data on lines 
18(vii) and I8(viii) (ie the modified instruction bits) are to be used to determine the operation actually 

40 performed by the processing element 

In the present: example of a SIMD array processor in accordance with the present invention, the 
programmable decoder is used to specify a different shift direction for different processing elements 
despite the restriction of the global instructions. The actual direction in which data is shifted is the result of 
the selection of one of the NEWS Inputs to the input multiplexer 62. Given that two bits of the Instruction 

45 code (eg. the bits 18(v) and I8(vi)) are used to specify the global direction of shift In the NEWS network, a 
look-up table for local modification of those two bits means that it is possible to individually specify a local 
direction of shift in each processing element in response to a given global shift instruction. 

The preparation of the SIMD array processor shown for performing an algorithm which exploits the local 
modification of shift instructions can be summarised as follows. 

so Successive 32-bit words of data are read into the Western edge of the array of processors via the 
bidirectional tri-state drivers in the toroidal East-West NEWS connections and are shifted across the array 
using global unmodified shift East instructions. When the first word of data has migrated across the array of 
processors, respective bits of the first 32 words of data are written into corresponding memory locations in 
the blocks of memory associated with each processing element using a global unmodified write Instruction. 

55 These steps are repeated until all the necessary information has been loaded Into memory. During or after 
the above sequence, modification data is read into the look-up tables of the processing elements using a 
global unmodified "load look-up table" instruction. Once this step has been performed, the array of 
processors can be used to perform algorithms using locally modifiable instructions. 
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4a man tinn B d above a SIMD array processor in accordance with the present invention is particularly 
sutelrSe £^£p£ta£ in order to expediata the input of image data to and the output 
TEL ^T^J rthe Inay of processors, a high bandwidth data bus could additionally be connected to 
of image data from the array of P^w. B with the present invention. Data could be 

SmS^M J video c2L53T*» and output to a video store or video display device via 
TSdSZh bus InsLd of over the controller-array data bus 26. The high bandwidth data bus 
cSd be c^STto the array In a similar manner to the controiler-array data bus. Alternatively a separate 
*ZEtS*L to the Input multiplexer 62 and a separate video output register (not shown) from 
2^^525 eiements such' as that shown in Ffcure 4 could be provided I for 
torJwidth data bus. The provision of a high bandwidth data bus is not. however, essenfal to the present 

inV6 Two n 'algorlthms will now be described which are of particular application in Image processing and 
JTflJlSS of an embodiment of a SIMD array processor in accordance with the present .nvention 
with the example of a programmable decoder unit shown in Figure 5. . .„ . , 

The EfSJrHhJ , takes data held in the processor array and rotates it by -90 degrees. For a four by 
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The algorithm essentially comprises a series of shift operates wh.ch allow date to be moved liraund 
the array of processing elements on one of a set of closed, non-overlapping -paths" or "loops such that 
starfng at any process^ element, exactly M steps along the path leads to the correct processing element 
mapping ?Se No* West quadrant of one possible way of setting out the set of loops for a 32 by 32 
processor aJray is iliustrated in Figure 6. The remaining quadrants can be inferred by rotational symmetry. 

Twill be noticed that some loops are shorter than others and some have a clockw.se and some an antt- 
clockwise direction of shift as indicated by the arrows. However, the common factor for each -of the loops is 
tSU S2E shitted 33 times atong the ioop on which it is iocated will end up in the copending 
position in the adjacent quadrant. In other words. In 33 steps, the whole arrays rotated by 90 degreoj. 

By allowing the individual specification of data shifts between processing elements, it is possible to 
transfer data in different directions within the network in one Instruction cycle It must be 'enrrcmbered hat 
STStar SIMD array processors, it was oniy possible to shift in one direction within the array at any on* time 
because of the constraint of having global shift operations. The provision, of a P«*rammabte deader ^ 
SIMD array processor according to the Invention for local instruction modification means the data can be 
SmedTdiCm directions despite the constraint of the global instructions. For the algorithm shown In 
I rSlflcSon data contained In the location accessed in the iook-up tables will vary from 
processor element to processor element to define the loops shown. 
The second algorithm concerns a reflection in the X axis. 

The Seond algorithm takes data held in the processor array and reflects it In the "X-axis". For a four 
by four array of bits the matrix of data before and after rotation will look as indicated below: 
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For ease of illustration the algorithm is shown for an 8 by 8 processor array in Figures 7a and 7b. It can 
easily be developed for a 32 by 32 processor array. The algorithm runs in two steps. The first runs for 4 
cycles and has the NEWS setting shown in Figure 7a. The second step runs in one cycle and is simply a 
global shift west This has the NEWS setting shown in Figure 7b. The algorithm takes 1 +n/2 cycles to 

5 Implement X-axis reflect on an n by n array (n even). 

For the algorithm shown it is only necessary to set the look-up tables once at the beginning of the 
reflection operation. Four first global shift instructions are issued which are locally modified to give the 
pattern shown in Figure 7a which cause any bit to be moved four steps along the path In which It is located. 
Then a single second global shift instruction (a global shift West) Is issued ^which does not need to be 

w locally modified and gives the NEWS pattern shown in Figure 7b. This causes any bit to be moved one 
step Westwards. It can be seen that each bit ends up in a position which forms the reflection in the X axis 
with respect to its original position. The change in the shift directions of Figure 7b with respect to Figure 7a 
is caused merely by. the use of two different global shift instructions. In the case of Figure 7a the global 
instruction is modified by modifier bits which vary from processor element to processor element. In the 

is case of Figure 7b, the global instruction is not modified. 

Figure 8 illustrates modifications to a processor element which further enhance the flexibility of a SIMD 
array processor in accordance with the present invention. 

in a typical operating cycle in a SIMD array processor a processing element in the array selects data 
from a single neighbour in the NEWS network- This does not, however, make optimal use of the network 

20 connections because only one of the input connections to each cell is used in any one shift cycle. It would 
appear that 75% of the NEWS connections are idle. In practice however, the NEWS connections are 
bidirectional, and as one of the "input "connections for a given processing element is in fact used for the 
output from that element, only 50% of the network Is in fact idle. Nevertheless, even this 50% represents an 
underutillsation of the network. The principle modification to the processing elements shown In Figure 8 is 

25 to provide two NEWS output registers Qns 54ns and Qew 54ew. The provision of these two registers 
provides the basis for allowing shift operations in two directions per cycle e.g. one item North, and one item 
East. This represents more efficient use of the NEWS wiring. In addition to the above modifications, some 
additional modifications of the processing elementsare necessary. 

As shown in Figure 8. the input multiplexer 62' is a multi-way multiplexer which separately selects two 

30 of the NEWS inputs to the processing element P'(iJ) at one time and supplies them to respective inputs 
Nns and New to the ALU 48'. In addition, a multiplexer function is provided within the ALU 48 to select 
between the outputs of the Qns and the Qew registers. The Qns register outputs data to the North and 
South neighbours, and the Qew register passes data to the East and West neighbours. Each of the Q 
registers may sample data from any of the four NEWS Inputs via the input multiplexer 62 and the ALU 48 . 

35 A typical cycle might consist of Qns taking data from the West, and Qew taking data from the South. Two 
such cycles would cause the two sets of data held in the Qns and Qew registers each to move diagonally 
North-East one position. With only one output register Q, four cycles would be required to achieve these 
shifts. In this way it is possible for a processing element to shift two bits of data simultaneously. 

Further details of the modifications to these components or of the changes to the control lines 68 , 

40 68 ,70' and to the logic In the decoder 64' need not be given here as they are merely a matter of routine to 
implement Also, as : will be apparent to the one skilled in the art, other modifications of the processing 
element are possible which will support the duplication of the Q register. 

Although the resulting processing element of Figure 8 is more complex than that shown In Figure 4. it 
does increase further the flexibility and efficiency of the NEWS network. This is achieved, moreover, without 

45 additional NEWS connections for each processing element if the input and output lines to processing 
elements are bidirectional, as the input and output NEWS connections of the processing elements are 
shared. Also, it is possible for a processing element as shown in Figure 8 to process two- bits of data 
simultaneously. 

In the following, three algorithms which exploit the modifications in Figure 8 are described with 
50 reference to Figures 9, 10 and 11. For ease of illustration, the algorithms are shown for an 8 by 8 array 
only, with each processing element P'(i.j) represented, as before, by a small circle. When interpreting the 
Figures, it should be remembered that the array is toroidally connected. 

The first of these algorithms illustrated in Figure 9. This figure represents a North-West shift over the 
whole array. In Figure 9 the arrow labelled 86 at the processing element 82 represents the use of the Qns 
55 output register 54ns in Figure 8 for receiving data via the input multiplexer 62 from the Eastern NEWS 
connection to that processing element. Similarly, the arrow labelled 84 at that processing element in Figure 
9a represents the use of the Qew output register 54ew in Figure 8 for receiving data from the Southern 
NEWS connection via the input multiplexer 62 to that processing element. Thus two transfers are being 
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oerformed in the processing element at that node. „^- aoen ,« 

For this aJoorithm the global shift instruction is decoded in the same way for each of the processors. 
lJ^2ff*£»+ the circles representing the processing elements indicates how ft. .flow* 
information along different paths Is separated. When reading the Figure it Is useful to think of foe diagonal 
toHfa ZS refSng Sm data flow. The thicker linea joining certain processing elements mdicate one 
uch P 4T t can be seen how an item of data is shifted in a north-westerly direction In two steps. As 
eacn proceSng efement handles two bits at once, only one step per North-Wes. bit shift is required 

The seconS algorithm Is Illustrated in Figure 10. For this algorithm, which transposes the array^ the 
orocessing elemente are programmed to decode a global shift instruction In two different ways. Processing 
e rents suTastoe processing elements 90 and 92 which are represented by a circle crossed by an 
obtTlgoVal line diode a global shift Instruction in the same way as the element ■ « a " d 
«To data is handled as represented by the arrows 84 and 88. Processing elements such as the slement 94 
which are represented by a simple circle, are programmed such that data Is selected from the Souft and 
East NEWS inputs by the input multiplexer 62' and loaded into the Qns and Qew registers respectively. 

Two^ata Sths. one 98 represented by heavy lines and one 98 represented by dotted lines are shown 
illustrating how the paths change direction at the processing elements shown wrth a diagonal line and cross 
over at the other- processing elements. , , _ . .„ . 

The third algorithm, which performs a rotation by 180*. is illustrated In Figure 11. As w,|| be apparent 
on sfcdvL this Figure, and in particular the two data paths represented by the heavy line 116 and the 
ashed te Tl8. I processing' elements are programmed to decode a global shift Mn*» m *vo 
Srent ways in each quadrant making a total of eight ways in all. These are set out In the following table. 



Quadrant 



NW 

NW 

NE 

NE 

SE 

SE 

SW 

SW 



Processing Element Representation 



Circle (eg 100) 

Circle and acute diagonal (eg 102) 
Circle (eg 104) 

Circle and obtuse diagonal (eg 106) 
Circle (eg 108) 

Circle and acute diagonal (eg 1 10) 
Circle (eg 112) 

Circle and obtuse diagonal (eg 114) 



Data Source For 
Qns Register 



South 

West 

North 

West 

North 

East 

South 

East 



Data Source For 
Qew Register 



West 

South 

West 

North 

East 

North 

East 

South 



It can be seen from the data paths 116 and 118 that a data bit can be rotated by 180 within One > 8 by 8 
processor array (eg from element 114 to element 106) in eight shifts or steps (ie along path 1 8 . As i each 
p ocessing element handles two bits simultaneously, the average number of steps per 180 rotation Is ^onfy 
four. This algorithm, like the others shown In Figures 9 and 10. can easily be generalised to an n by n array 

Wh 7s n pl^ Pressor in accordance with the present invention with possible 

modifications thereto has been described herein. It will be apparent to the skilled person however, that 
many other modifications and alternatives are possible within the scope of the appended claims For 
sample, although the look-up table is only described for modifying two bits of the of the nation cod U 
wW be apparent that a look-up table can be provided for modifying a different number of bits and diffe en 
types of instructions (ie. not Just shift instructions). In an alternative embodiment, the row and column select 
SSTmN also be wired to form part of the input to the look-up table so that the select signals on those 
ZTs could also be used for specifying local modifications to the global instructor i through the use of the 
look-up table. The programmable decoder is described herein as comprising a fir<rt part in the form o 
conventional fixed decoder logic such as hard-wired gates and a second, programmable part m the form i of 
a r^tble K will be apparent to one skilled In the art however, that alternative programmable decoder 
means, in which part of the array is fixed and part is programmable during processing, can be used instead 
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Claims 

1 A SIMD array processor comprising a multi-dimensional array (12) ol processing elements <P) and 
control logic for issuing global instructions to said array, characterised in that a processing element (P Ij) 
includes programmable decoding means (64) for the individual decoding for execution by that process ng 
element of a global instruction, the programmable decoding means of respective processing elements being 
programmable In response to a global load Instruction from the control logic. 

Z. A SIMD array processor as claimed in claim 1 wherein the programmable decoding means 
comprises programmable modifying means (74) for locally modifying selected, bits of the global instruction 
to form locally modified bits and fixed decoding means (72) for decoding bits of the global instruction as 
received and said locally modified bits. 

3 A SIMD array processor as claimed in claim 2 wherein processing elements In the array are each 
associated with storage (16) for control Information and/or and data and wherein, in response to said global 
load instruction, the fixed decoding means (72) in a processing element causes modification information to 
be loaded from selected locations In said storage Into the programmable modifying means (74) of that 
processing element, whereby the programmable modifying means of respective processing elements may 

be programmed. *- ^ 

4. A SIMD array processor as claimed in claim 3 wherein processing elements (P) in the array are each 
associated with a corresponding block of skid storage (16) for control information and/or and data and 

20 wherein means are provided for the control logic (14) to access said storage.(16) for storing appropriate 
modification Information for the programmable decoding means (74) In corresponding locations in each said 
block of storage. 

5. A SIMD array processor as claimed in claim 2 or in any claim dependent thereon, wherein the 
programmable modifying means (74) comprises a look-up table. 

6. A SIMD array processor as claimed in claim 5 wherein the look-up table (74) compnses a serial wnte 
port (76) for receiving said modification information serially from storage and a parallel read port for 
receiving said selected bits (18(v). 18(vi)> of a global instruction in parallel. 

7. A SIMD array processor as claimed in any of the preceding claims in which the programmable 
decoding means (74) is adapted to be programmed to locally modify a global shift instruction, whereby data 
distributed throughout the array (12) may be shifted at one time in a plurality of locally determined 

directions within the array. 

8. A SIMD array processor as claimed in any of the preceding claims wherein the processing elements 
(P) are interconnected in a plurality of orthogonal directions. 

9. A SIMD array processor as claimed in claim 8 wherein each processing element comprises a 
plurality of output registers (54) and an input multiplexer (62) such that multiple simultaneous data shift 
operations can be performed in a plurality of orthogonal directions at one time within a processing element 

10. A display system comprising a SIMD array processor as claimed in any of the preceding claims. 
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